An Introduction to text matching
Previous Back to contents Next

To those unfamiliar with matching languages they do look very cryptic at first, but don't worry - the idea is really very simple. Certain characters, often called wildcards or meta characters, are given special meaning. Each of these characters will match parts of the original text only if they meet certain conditions. The text that's been matched can then be replaced by something else.

For example, the asterisk "*" will match any unknown group of characters, no matter what they are. It's normally used to match a section of text you're not sure about. For instance, say you were trying to match any word that ended with the letters "ko". Using "*ko" would match "Naoko" or "Atsuko" but it wouldn't match a ko-less "Michie". While, something like "john*smith" would match "John W Smith", "John 'Bubba' Smith", not to mention plain 'ol "John Smith".

Applying the idea to HTML - say you wanted to match all image tags. An image tag always begins with "<img" and ends with a ">", but it can also have any number of things in between. A matching expression like "<img * >" could be used. It's like saying...

Match anything that starts with <img, possibly has some other stuff here, then ends with ">".

In the replacement text you could then re-write the image tag to say exactly what you want it to. It's even possible to "capture" parts of the original text you may want to keep (like the URL of the image for instance) to use in the replacement text. Look at the following...

Matching: <img * src=(\w)\1 * >
Replace: <img src=\1 border=1 >

This introduces a few new ideas, first of all the "\w" (or word match) will match any continuous string of text unbroken by a space - it's useful for matching URLs. the parenthesis "( ... )" followed by the "\1" basically say "Stick whatever is matched between ( and ) and place it into variable number one". The "\1" in the replacement text then just inserts the contents of variable number one at that location. The Proxomitron matching language features ten such variables numbered 0-9.

Put into action, the above rule would re-write an image tag that looked like...

<img align=left src="bison.gif" alt="My pet bison Phil" >

Into...

<img src="bison.gif" border=1 >

The part in blue it what the first "*" matched.
The part in red is what the "(\w)\1" matched.
The part in green is what the last "*" matched.

Notice that the blue and green bits never appear in the replacement text. Only the bit we decided to keep by using the number variable does. By deciding what to keep and what to throw away we can completely rework a bit of HTML. For example, say we wanted to change the above image so that instead of showing us our bison, it gave us a link we could click to see it. If we changed the replacement text to read...

Replace: <a href=\1 > a picture </a>

Then the same image text of...

<img align=left src="bison.gif" alt="My pet bison Phil" >

would now become...

<a href="bison.gif" > a picture </a>

Hopefully this will give you a brief idea of how pattern matching works. Read about the other meta characters to learn more about putting these ideas into action.

Return to matching rules


Return to main index